Credit Card Customer Segmentation with Unsupervised Learning

I. Business Understanding

Background and Problem Statement

Credit card business competition is very tight. Customers can easily switch to another credit card that has lower overall fees. According to How to Retain a Credit Card Customer, it costs about \$80 to get a new credit card customer who would returns about \$120 a year in profit to the company only if they keep the card. If he drops the card after a few weeks or doesn't use the card, the company will lose the customer acquisition cost (CAC) plus some more money when trying to reactivate them. In addition, Financial Publishing Services also state in their research that CAC is estimated at five times the rate of retaining existing ones. It clearly shows that every credit card issuer must put their best effort to retain their customers. The right retention strategy can increase the company's chances of retaining its customers and further reduce the estimated loss that will be cover by the company.

Customer loyalty is one of a key factor to keep the customers loyal to our company. Why is customer loyalty is important? Customer loyalty is important because; repeat customers spend more than first-time customers, loyal customers produce higher conversion rates, it boosts profits, retaining an existing customer is cheaper than acquiring a new one, customer loyalty helps in effective planning, loyal customer shop regularly, repeat customers spend more during the holiday. We can increase the loyalty of our customers by reward loyal customers with a loyalty program, make customer care a priority for the brand, boost customer experience by introducing VIP tiers, segment and personalize your clients, send event-based emails, optimize the businesses' referral program, encourage customers to give feedback and act on it. We can measure our customer loyalty based on lifetime value, churn rate, referrals, and net promoter code.

PWDK Bank is one of the credit card issuers in USA. Currently the company only has one type of credit card. In order to serve customers better, the company plan to release new types of credit card based on customer's needs. In this project, we position ourselves as part of the Data Scientist team at PWDK Bank. We were assigned to the marketing division to segment credit card users based on credit card usage in the last 6 months.

Business Objectives

The business objectives that we want to achieve through this project are as follows:

Data Availability

The data owned by the company is limited to credit card users only. We will use the information from their credit card usage behavior like credit limit, balance, purchases, and payments from the last six months.

Analytic Approach

Unsupervised learning becomes our machine learning model to address this issue since it didn't have labels. The cleaned dataset will be trained into the unsupervised learning model and the model will generate customer segmentation. The goodness of cluster resulted by the model will be evaluated using clustering evaluation metrics like silhouette scores and within-cluster sum of square (WCSS).

Incorrect segmentation can lead to two possible losses for the bank:

II. Data Understanding

We used Credit Card Dataset for Clustering dataset in Kaggle. The dataset summarizes the usage behavior of 8950 credit card customers during the last 6 months. The file is at a customer level with 18 behavioral variables. The following is a display of the first five rows of the dataset. Below is the definition of each features:

III. Exploratory Data Analysis

Data Distribution Plot

We check data distribution for each feature using descriptive statistics, histogram, and boxplot.

Data Correlation

We calculate Spearman's correlation coefficient between features and show the result on a heatmap.

Identify Missing Values, Duplicates, and Outliers

IV. Data Preprocessing

Remove Unnecessary Features

Based on the results of data understanding and exploratory data analysis, we decided not to use CUST_ID as input to the machine learning clustering model. CUST_ID has a unique value for each entry and does not provide any useful information for customer segmentation.

Missing Values Handling

There are 313 missing values on MINIMUM_PAYMENTS and 1 missing value on CREDIT_LIMIT.

V. Customer Segmentation Modeling and Analysis

Clustering Algorithms

For the modeling process, we used 3 clustering algorithms as below:

Evaluation Metrics

The main evaluation metrics that we used are silhouette score and WCSS (Within-Cluster Sum of Square). We can applied silhouette score to evaluate any cluster results and WCSS only to evaluate K-Means cluster results. Silhouette score values range from -1 to 1 and here is the interpretation of 1, 0, and -1.

Predefined Functions

We create several functions for displaying cluster evaluation results and generating some plots before the modeling process

List of cluster evaluation function

  1. kmeans_cluster
  2. kmeans_eval
  3. agglo_cluster
  4. agglo_eval
  5. gmm_cluster
  6. gmm_eval

List plot function created

  1. plot_elbow
  2. plot_silhouette
  3. plot_scatter
  4. plot_scatter_3d
  5. plot_violin
  6. plot_dendogram

Features Selection

We want to build a model that has high interpretability, so feature selection can be done to reduce the model complexity. Selected features will have significant impact to the cluster results. Feature selection was done in the following procedures:

  1. Cluster the dataset using all features and evaluate the results of the clustering for each cluster algorithm
  2. Select the optimal number of cluster for each cluster algorithm
  3. Generate violinplot for each feature with the number of cluster selected and evaluate the potential significant features visually
  4. Create a new dataset by adding the clustering result to the initial dataset
  5. Split dataset into train and test by using clustering results column as a targets or labels
  6. Train a classification machine learning model that has a feature_importances attribute (decision trees and random forest) with the train set
  7. Predict the test set labels outcome with the selected classification machine learning model and make sure it has good evaluation score
  8. Generate the feature_importances from the model and analyze it for each cluster algorithm

K-Means Initial Clustering Performance

Based on the silhouette score below, we were able to find some insight as follows:

  1. K-Means has a relatively good and stable silhouette score across all the number clusters
  2. The decrease of WCSS also proportional for each addition of the number of clusters

Agglomerative Initial Clustering Performance

Based on the silhouette score below, we were able to find some insight as follows:

  1. Single Linkage, Complete Linkage, Average Linkage have relatively high silhouette scores, but the distribution of cluster membership generated by them is extremely unbalanced. There is always 1 cluster that contains around 8940 customers and the rest distributed to other clusters,
  2. We could conclude that Ward Linkage is the best out of 4 linkages. The model has a better distribution of cluster membership, although produces the worst silhouette score.

GMM Initial Clustering Performance

Based on the silhouette score below, we were able to find some insight as follows:

  1. Silhouette scores are relatively low across all covariances type, but the cluster membership for each cluster are fairly evenly distributed.
  2. Tied Covariance GMM produces the best silhouette score.

Based on violinplot above, we could see there are few variables affect to the cluster results.

  1. BALANCE
  2. PURCHASES (ONEOFF and INSTALLMENTS feature is already covered by PURCHASES)
  3. CASH ADVANCE
  4. CREDIT_LIMIT
  5. PAYMENTS
  6. MINIMUM_PAYMENTS

Based on feature importances from Random Forest Classifier using 3 clustering algorithms (K-Means, Agglomerative, GMM), we could see top 5 features are:

  1. K-Means : CREDIT_LIMIT, BALANCE, PAYMENTS, MINIMUM_PAYMENTS, CASH_ADVANCE
  2. Agglomerative : CREDIT_LIMIT, BALANCE, PAYMENTS, PURCHASES, ONEOFF
  3. GMM : BALANCE, MINIMUM_PAYMENTS, CREDIT_LIMIT, CASH_ADVANCE, PURCHASES

Based on the feature importances above, we choose 3 features:

  1. BALANCE
  2. CREDIT_LIMIT
  3. PAYMENTS

Cluster Modeling

K-Means Algorithm

Agglomerative Clustering

  1. Single Linkage Clustering
  2. Complete Linkage Clustering
  3. Average Linkage Clustering
  4. Ward Linkage Clustering

Gaussian Mixture Models (GMM)

GMM has different covariance type options:

The following are the comparison of silhouette scores for each algorithm within a range number of cluster:

Cluster Analysis

Based on the results of the silhouette score and the visualization of the separation between clusters, we choose the Ward Linkage as the final cluster model. To be able to interpret the behavior characteristics of each cluster, we made several visualizations as follows:

  1. Clusters visualization in 2D and 3D
  2. Boxplot (exclude outliers)

From visualization and descriptive statistics above, we can conclude:

CLUSTER 0 (LOW SPENDERS)

This cluster consist of 80% credit card customers who have LOW BALANCES, LOW PAYMENTS, and LOW CREDIT_LIMIT.

VI. Recommendations

Credit Card Product

PWDK Silver Credit Card

VII. Business Impact

Based on data from Statista, we know that the average churn rate in 2020 for the financial and credit sector is 25%. According to Bain & Company, by migrating to the right customer relationships and improving the value proposition, customer segmentation can increase retention by 4% and increase profits by 11%. From other sources, retention may increase varies from a range of 3-5%. After having a market segment, we assume that the customer retention rate at PWDK Bank will increase 3%-5% with each customer return $120 in profit by using our credit card product.